Covering Ambiguity Resolution in Chinese Word Segmentation Based on Contextual Information
نویسندگان
چکیده
Covering ambiguity is one of the two basic types of ambiguities in Chinese word segmentation. We regard its resolution as equivalent to word sense disambiguation, and make use of the classical vector space model in information retrieval to formulate the contexts of ambiguous words. A variation form of TFIDF weighting is proposed and a Chinese thesaurus is additionally utilized to cope with data sparseness problem. We select 90 frequent cases of covering ambiguities as the target. The training set includes 77654 sentences, and the test set includes 19242 sentences. The experimental results showed that our model has achieved 96.58% accuracy, outperforming the original form of TFIDF weighting as well as another baseline model, the hidden Markov model.
منابع مشابه
Ambiguity Resolution in Chinese Word Segmentation
A new method for Chinese word segmentation named Conditional F&BMM (Forward and Backward Maximal Matching) which incorporates both bigram statistics (i.e., mutual information and difference of t-test between Chinese characters) and linguistic rules for ambiguity resolution is proposed in this paper. The key characteristics of this model are the use of: (i) statistics which can be automatically ...
متن کاملUnknown Word Identification for Chinese Morphological Analysis ∗
Since written Chinese does not use blank spaces to indicate word boundaries, segmenting Chinese texts becomes an essential task for Chinese language processing. Besides word segmentation, we also need to identify the part-of-speech (POS) tags of the words. The segmentation and POS tagging process are denoted as morphological analysis. During the process of word segmentation, two main problems o...
متن کاملAn integrated approach for Chinese word segmentation
This paper presents an integrated approach for Chinese word segmentation, which can perform disambiguation and unknown word identification simultaneously on the input. In this work, a hybrid model is used to score known word candidates and unknown word candidates equally by incorporating the modified word-formation models (viz. word-juncture models and wordformation patterns) into word bigram m...
متن کاملChinese Word Segmentation Based on Contextual Entropy
Chinese is written without word delimiters so word segmentation is generally considered a key step in processing Chinese texts. This paper presents a new statistical approach to segment Chinese sequences into words based on contextual entropy on both sides of a bigram. It is used to capture the dependency with the left and right contexts in which a bigram occurs. Our approach tries to segment b...
متن کاملResearch on Tibetan Automatic Word Segmentation
This paper researches on Tibetan automatic word segmentation. We focus on three key technologies of Tibetan automatic word segmentation: (1) a Tibetan automatic word segmentation approach is proposed, which is taking the advantage of case-auxiliary words and continuous feature. (2) a resolution method of overlapping ambiguity in Tibetan word segmentation is proposed, which is based on forward-b...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002